First we’ll download the necessary packages, libraries and data.

library (arules)
library(dplyr)
library(arulesViz)

options(scipen = 999)

Associative Rule Mining of Boston Dataset

data(Boston, package = "MASS")

Reviewing the data

We can look at the different types of variable class within the Boston data

lapply(Boston, class)
## $crim
## [1] "numeric"
## 
## $zn
## [1] "numeric"
## 
## $indus
## [1] "numeric"
## 
## $chas
## [1] "integer"
## 
## $nox
## [1] "numeric"
## 
## $rm
## [1] "numeric"
## 
## $age
## [1] "numeric"
## 
## $dis
## [1] "numeric"
## 
## $rad
## [1] "integer"
## 
## $tax
## [1] "numeric"
## 
## $ptratio
## [1] "numeric"
## 
## $black
## [1] "numeric"
## 
## $lstat
## [1] "numeric"
## 
## $medv
## [1] "numeric"

Most variables are shown as numeric, except for chas and rad, we’ll review those and convert them to factors.

unique(Boston$chas)
## [1] 0 1
unique(Boston$rad)
## [1]  1  2  3  5  4  8  6  7 24

Processing

We’ll create a new dataset with the new variable class.

b <- Boston
b$chas <- factor(Boston$chas, labels = c("river", "noriver"))
b$rad <- factor(Boston$rad)

The variavle b$black will be cut for better interpretation.

b$black <- cut(Boston$black, breaks = 4, labels = c(">31.5%", "18.5-31.5%", "8-18.5%", "<8%"))

Now we can discretize all the remaining variables in dataset b by putting them into 4 equal-width bins.

**We then pull out chas, rad, and black to mutate the other numeric variables, or put them into the 4 equal-width bins. after the bins are created we’ll put our chas, rad and black variables back into the dataset.

**Last, we’ll turn the dataset b into a transactional dataset.

discrt <-function(x) cut(x, breaks = 4, labels = c("low", "medlow", "medhigh", "High"))

b <- select(b, -c("chas", "rad", "black")) %>%
  mutate_all(funs(discrt)) %>%
  bind_cols(select(b, c("chas", "rad", "black")))

dim(b)
## [1] 506  14
summary(b)
##       crim           zn          indus          nox            rm     
##  low    :491   low    :429   low    :202   low    :200   low    :  8  
##  medlow : 10   medlow : 32   medlow :112   medlow :182   medlow :234  
##  medhigh:  2   medhigh: 16   medhigh:165   medhigh:100   medhigh:236  
##  High   :  3   High   : 29   High   : 27   High   : 24   High   : 28  
##                                                                       
##                                                                       
##                                                                       
##       age           dis           tax         ptratio        lstat    
##  low    : 51   low    :305   low    :240   low    : 58   low    :243  
##  medlow : 97   medlow :144   medlow :128   medlow : 68   medlow :187  
##  medhigh: 96   medhigh: 52   medhigh:  1   medhigh:171   medhigh: 57  
##  High   :262   High   :  5   High   :137   High   :209   High   : 19  
##                                                                       
##                                                                       
##                                                                       
##       medv          chas          rad             black    
##  low    :116   river  :471   24     :132   >31.5%    : 31  
##  medlow :284   noriver: 35   5      :115   18.5-31.5%:  8  
##  medhigh: 74                 4      :110   8-18.5%   : 15  
##  High   : 32                 3      : 38   <8%       :452  
##                              6      : 26                   
##                              2      : 24                   
##                              (Other): 61
b <- as(b, "transactions")

Check the columns to make sure they’re all in the bins:

colnames(b)
##  [1] "crim=low"         "crim=medlow"      "crim=medhigh"    
##  [4] "crim=High"        "zn=low"           "zn=medlow"       
##  [7] "zn=medhigh"       "zn=High"          "indus=low"       
## [10] "indus=medlow"     "indus=medhigh"    "indus=High"      
## [13] "nox=low"          "nox=medlow"       "nox=medhigh"     
## [16] "nox=High"         "rm=low"           "rm=medlow"       
## [19] "rm=medhigh"       "rm=High"          "age=low"         
## [22] "age=medlow"       "age=medhigh"      "age=High"        
## [25] "dis=low"          "dis=medlow"       "dis=medhigh"     
## [28] "dis=High"         "tax=low"          "tax=medlow"      
## [31] "tax=medhigh"      "tax=High"         "ptratio=low"     
## [34] "ptratio=medlow"   "ptratio=medhigh"  "ptratio=High"    
## [37] "lstat=low"        "lstat=medlow"     "lstat=medhigh"   
## [40] "lstat=High"       "medv=low"         "medv=medlow"     
## [43] "medv=medhigh"     "medv=High"        "chas=river"      
## [46] "chas=noriver"     "rad=1"            "rad=2"           
## [49] "rad=3"            "rad=4"            "rad=5"           
## [52] "rad=6"            "rad=7"            "rad=8"           
## [55] "rad=24"           "black=>31.5%"     "black=18.5-31.5%"
## [58] "black=8-18.5%"    "black=<8%"

Get a summary of the newly discretized and cute dataset.

summary(b)
## transactions as itemMatrix in sparse format with
##  506 rows (elements/itemsets/transactions) and
##  59 columns (items) and a density of 0.2372881 
## 
## most frequent items:
##   crim=low chas=river  black=<8%     zn=low    dis=low    (Other) 
##        491        471        452        429        305       4936 
## 
## element (itemset/transaction) length distribution:
## sizes
##  14 
## 506 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      14      14      14      14      14      14 
## 
## includes extended item information - examples:
##         labels variables  levels
## 1     crim=low      crim     low
## 2  crim=medlow      crim  medlow
## 3 crim=medhigh      crim medhigh
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

Now we’ll inspect the first 9 transactions.

inspect(b[1:3])
##     items             transactionID
## [1] {crim=low,                     
##      zn=low,                       
##      indus=low,                    
##      nox=medlow,                   
##      rm=medhigh,                   
##      age=medhigh,                  
##      dis=medlow,                   
##      tax=low,                      
##      ptratio=medlow,               
##      lstat=low,                    
##      medv=medlow,                  
##      chas=river,                   
##      rad=1,                        
##      black=<8%}                   1
## [2] {crim=low,                     
##      zn=low,                       
##      indus=low,                    
##      nox=low,                      
##      rm=medhigh,                   
##      age=High,                     
##      dis=medlow,                   
##      tax=low,                      
##      ptratio=medhigh,              
##      lstat=low,                    
##      medv=medlow,                  
##      chas=river,                   
##      rad=2,                        
##      black=<8%}                   2
## [3] {crim=low,                     
##      zn=low,                       
##      indus=low,                    
##      nox=low,                      
##      rm=medhigh,                   
##      age=medhigh,                  
##      dis=medlow,                   
##      tax=low,                      
##      ptratio=medhigh,              
##      lstat=low,                    
##      medv=medhigh,                 
##      chas=river,                   
##      rad=2,                        
##      black=<8%}                   3

We can plot the frequency of the cut variables as seen below:

itemFrequencyPlot(b, support=.3, cex.names=.8)

Using association rules

Now we’re going to apply the aprior method to the b dataset with a .025% support and 75% confidence, which gives us a minimum support count of 12 and 10 subsets before reaching a maximum.

ars <- apriori(b, parameter = list(support=.025, confidence=.75))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.75    0.1    1 none FALSE            TRUE       5   0.025      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 12 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[59 item(s), 506 transaction(s)] done [0.00s].
## sorting and recoding items ... [52 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.03s].
## writing ... [408638 rule(s)] done [0.10s].
## creating S4 object  ... done [0.17s].

We can get a summary of the data that includes the number of (x) left-hand-side and (y)right-hand-side rules that satisfy our support and confidence constraints.

summary(ars)
## set of 408638 rules
## 
## rule length distribution (lhs + rhs):sizes
##      1      2      3      4      5      6      7      8      9     10 
##      4    293   3650  18932  53620  92554 103550  78411  41677  15947 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.000   7.000   6.846   8.000  10.000 
## 
## summary of quality measures:
##     support          confidence          lift             count       
##  Min.   :0.02569   Min.   :0.7500   Min.   : 0.7799   Min.   : 13.00  
##  1st Qu.:0.02964   1st Qu.:0.9189   1st Qu.: 1.0743   1st Qu.: 15.00  
##  Median :0.03755   Median :1.0000   Median : 1.6590   Median : 19.00  
##  Mean   :0.04857   Mean   :0.9517   Mean   : 1.9759   Mean   : 24.58  
##  3rd Qu.:0.05534   3rd Qu.:1.0000   3rd Qu.: 2.4211   3rd Qu.: 28.00  
##  Max.   :0.97036   Max.   :1.0000   Max.   :19.4615   Max.   :491.00  
## 
## mining info:
##  data ntransactions support confidence
##     b           506   0.025       0.75

As the example notes, we are interested in the association between pollution (NOX) and property value (MEDV) so we’ll find the top 5 rules by confidence with “medv=High”, and “medv=Low” attributes on the rhs:

inspect(head(subset(ars, subset=rhs %in% "medv=High"), 5, by="confidence"))
##     lhs              rhs            support confidence    lift count
## [1] {rm=High,                                                       
##      ptratio=low} => {medv=High} 0.02964427          1 15.8125    15
## [2] {rm=High,                                                       
##      ptratio=low,                                                   
##      lstat=low}   => {medv=High} 0.02964427          1 15.8125    15
## [3] {rm=High,                                                       
##      ptratio=low,                                                   
##      black=<8%}   => {medv=High} 0.02964427          1 15.8125    15
## [4] {crim=low,                                                      
##      rm=High,                                                       
##      ptratio=low} => {medv=High} 0.02964427          1 15.8125    15
## [5] {rm=High,                                                       
##      ptratio=low,                                                   
##      lstat=low,                                                     
##      black=<8%}   => {medv=High} 0.02964427          1 15.8125    15

Here we find the subsets of medv=low, lhs => rhs:

inspect(head(subset(ars, subset=rhs %in% "medv=low"), 5, by="confidence"))
##     lhs                rhs           support confidence     lift count
## [1] {nox=medhigh,                                                     
##      lstat=medhigh} => {medv=low} 0.05928854          1 4.362069    30
## [2] {nox=medhigh,                                                     
##      lstat=medhigh,                                                   
##      rad=24}        => {medv=low} 0.05928854          1 4.362069    30
## [3] {nox=medhigh,                                                     
##      tax=High,                                                        
##      lstat=medhigh} => {medv=low} 0.05928854          1 4.362069    30
## [4] {indus=medhigh,                                                   
##      nox=medhigh,                                                     
##      lstat=medhigh} => {medv=low} 0.05928854          1 4.362069    30
## [5] {nox=medhigh,                                                     
##      ptratio=High,                                                    
##      lstat=medhigh} => {medv=low} 0.05928854          1 4.362069    30

And now we’ll compare the rhs in the lhs for high pollution (nox=High)

 inspect(head(subset(ars, subset=rhs %in% "nox=High" | lhs %in% "nox=High")))
##     lhs           rhs             support    confidence lift      count
## [1] {nox=High} => {indus=medhigh} 0.04743083 1.0000000  3.0666667 24   
## [2] {nox=High} => {rm=medlow}     0.03754941 0.7916667  1.7118946 19   
## [3] {nox=High} => {age=High}      0.04743083 1.0000000  1.9312977 24   
## [4] {nox=High} => {dis=low}       0.04743083 1.0000000  1.6590164 24   
## [5] {nox=High} => {zn=low}        0.04743083 1.0000000  1.1794872 24   
## [6] {nox=High} => {black=<8%}     0.03754941 0.7916667  0.8862463 19

Instead of looking at high property value medV=High confidence, we’ll look at support

inspect(head(subset(ars, subset=rhs %in% "medv=High"), 5, by="support"))
##     lhs                              rhs         support    confidence
## [1] {rm=High}                     => {medv=High} 0.04743083 0.8571429 
## [2] {rm=High,lstat=low}           => {medv=High} 0.04743083 0.8571429 
## [3] {rm=High,black=<8%}           => {medv=High} 0.04743083 0.8571429 
## [4] {crim=low,rm=High}            => {medv=High} 0.04743083 0.8571429 
## [5] {rm=High,lstat=low,black=<8%} => {medv=High} 0.04743083 0.8571429 
##     lift     count
## [1] 13.55357 24   
## [2] 13.55357 24   
## [3] 13.55357 24   
## [4] 13.55357 24   
## [5] 13.55357 24

Now we’ll look at rules generated from maximal and closed itemsets:

We’ll start with maximal itemsets that are at our support constraint and somewhat above the confidence constraint. The Maximal count is 13.

inspect(head(subset(ars, subset=is.maximal(ars), 5, by="confidence")))
##     lhs                                       rhs         support  
## [1] {zn=low,lstat=medlow,chas=noriver}     => {crim=low}  0.0256917
## [2] {crim=low,lstat=medlow,chas=noriver}   => {zn=low}    0.0256917
## [3] {rm=medhigh,chas=noriver,black=<8%}    => {crim=low}  0.0256917
## [4] {crim=low,rm=medhigh,chas=noriver}     => {black=<8%} 0.0256917
## [5] {rm=medlow,ptratio=low,medv=medlow}    => {crim=low}  0.0256917
## [6] {indus=medhigh,age=medhigh,chas=river} => {crim=low}  0.0256917
##     confidence lift     count
## [1] 1          1.030550 13   
## [2] 1          1.179487 13   
## [3] 1          1.030550 13   
## [4] 1          1.119469 13   
## [5] 1          1.030550 13   
## [6] 1          1.030550 13

To find our closed datasets we need to find find and pull out the most frequent itemsets:, 52 items are recorded:

freq.itemsets <- apriori(b, parameter = list(target="frequent itemsets", support=.025))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5   0.025      1
##  maxlen            target   ext
##      10 frequent itemsets FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 12 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[59 item(s), 506 transaction(s)] done [0.00s].
## sorting and recoding items ... [52 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.11s].
## writing ... [106259 set(s)] done [0.03s].
## creating S4 object  ... done [0.09s].

Now we can find and review the closed itemsets, and of the 10 subsets, the first has a total of 13 closed itemetems.

inspect(head(subset(ars, subset=is.closed(freq.itemsets), 5, by="confidence")))
##     lhs           rhs               support    confidence lift      count
## [1] {rad=1}    => {black=<8%}       0.03952569 1.0000000  1.1194690 20   
## [2] {nox=High} => {indus=medhigh}   0.04743083 1.0000000  3.0666667 24   
## [3] {nox=High} => {dis=low}         0.04743083 1.0000000  1.6590164 24   
## [4] {nox=High} => {zn=low}          0.04743083 1.0000000  1.1794872 24   
## [5] {nox=High} => {black=<8%}       0.03754941 0.7916667  0.8862463 19   
## [6] {rad=2}    => {ptratio=medhigh} 0.03557312 0.7500000  2.2192982 18
closed = freq.itemsets[is.closed(freq.itemsets)]
summary(closed)
## set of 11351 itemsets
## 
## most frequent items:
##   crim=low  black=<8%     zn=low chas=river    dis=low    (Other) 
##       9822       7839       7794       6833       4237      44615 
## 
## element (itemset/transaction) length distribution:sizes
##    1    2    3    4    5    6    7    8    9   10 
##   13   72  239  632 1309 2059 2346 1782  829 2070 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.000   7.000   7.148   9.000  10.000 
## 
## summary of quality measures:
##     support            count       
##  Min.   :0.02569   Min.   : 13.00  
##  1st Qu.:0.03360   1st Qu.: 17.00  
##  Median :0.04743   Median : 24.00  
##  Mean   :0.06915   Mean   : 34.99  
##  3rd Qu.:0.07708   3rd Qu.: 39.00  
##  Max.   :0.97036   Max.   :491.00  
## 
## includes transaction ID lists: FALSE 
## 
## mining info:
##  data ntransactions support confidence
##     b           506   0.025          1

We can compare the frequencies to the maximum subsets which start at 4 with a maximum count of 4 transactions.

maximal = freq.itemsets[is.maximal(freq.itemsets)]
summary(maximal)
## set of 2949 itemsets
## 
## most frequent items:
##   crim=low chas=river     zn=low  black=<8%    dis=low    (Other) 
##       2401       2247       2238       2093       1755      16939 
## 
## element (itemset/transaction) length distribution:sizes
##    4    5    6    7    8    9   10 
##    4   21   57  184  295  318 2070 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   9.000  10.000   9.384  10.000  10.000 
## 
## summary of quality measures:
##     support            count      
##  Min.   :0.02569   Min.   :13.00  
##  1st Qu.:0.02569   1st Qu.:13.00  
##  Median :0.02964   Median :15.00  
##  Mean   :0.03450   Mean   :17.46  
##  3rd Qu.:0.03755   3rd Qu.:19.00  
##  Max.   :0.14032   Max.   :71.00  
## 
## includes transaction ID lists: FALSE 
## 
## mining info:
##  data ntransactions support confidence
##     b           506   0.025          1

Now we can look at shorter rules:

inspect(head(subset(ars, subset= size(lhs) <5 & size(lhs) >1), 5, by="support"))
##     lhs                       rhs          support   confidence lift     
## [1] {chas=river,black=<8%} => {crim=low}   0.8083004 0.9784689  1.0083610
## [2] {crim=low,black=<8%}   => {chas=river} 0.8083004 0.9232506  0.9918573
## [3] {crim=low,chas=river}  => {black=<8%}  0.8083004 0.8969298  1.0040852
## [4] {zn=low,chas=river}    => {crim=low}   0.7569170 0.9623116  0.9917101
## [5] {crim=low,zn=low}      => {chas=river} 0.7569170 0.9251208  0.9938665
##     count
## [1] 409  
## [2] 409  
## [3] 409  
## [4] 383  
## [5] 383

And modify the previous review by raising the lift.

inspect(head(subset(ars, subset=size(lhs)<5 & size(lhs) >1 & lift >2), 5, by="support"))
##     lhs                             rhs         support   confidence
## [1] {nox=low,black=<8%}          => {indus=low} 0.3221344 0.8150000 
## [2] {indus=low,black=<8%}        => {nox=low}   0.3221344 0.8069307 
## [3] {crim=low,nox=low}           => {indus=low} 0.3221344 0.8150000 
## [4] {crim=low,indus=low}         => {nox=low}   0.3221344 0.8069307 
## [5] {crim=low,nox=low,black=<8%} => {indus=low} 0.3221344 0.8150000 
##     lift     count
## [1] 2.041535 163  
## [2] 2.041535 163  
## [3] 2.041535 163  
## [4] 2.041535 163  
## [5] 2.041535 163

Data Visualization

first we’ll review the interactive scatter plot for all 408638 rules.

plot(ars, engine = "htmlwidget", jitter = 0)

Now we’ll look at a grouped matrix using new constraints. We use this to compare the support and lift association of lhs and rhs.

somerules <- subset(ars, subset=size(lhs)>1 & confidence >.9 & support >0.5)
plot(somerules, method = "grouped")

This matrix with the top 21 rules shows the many antecedents that are found within the 4 consequents, {chas=river}" “{crim=low}” “{black=<8%}” “{zn=low}:

plot(somerules, method = "matrix")
## Itemsets in Antecedent (LHS)
##  [1] "{crim=low,dis=low,chas=river}" "{crim=low,dis=low}"           
##  [3] "{dis=low,chas=river}"          "{crim=low,medv=medlow}"       
##  [5] "{medv=medlow,black=<8%}"       "{medv=medlow,chas=river}"     
##  [7] "{chas=river,black=<8%}"        "{zn=low,chas=river,black=<8%}"
##  [9] "{zn=low,black=<8%}"            "{crim=low,zn=low}"            
## [11] "{crim=low,black=<8%}"          "{zn=low,chas=river}"          
## [13] "{crim=low,zn=low,black=<8%}"   "{zn=low,dis=low}"             
## [15] "{crim=low,zn=low,dis=low}"     "{zn=low,dis=low,chas=river}"  
## Itemsets in Consequent (RHS)
## [1] "{chas=river}" "{crim=low}"   "{black=<8%}"  "{zn=low}"

Use the 4 consequents to create a network graph.

plot(somerules, method = "graph", engine = "htmlwidget")

[REQUIRED]

Mine association rules from your dataset.

Associative Rule Mining of our Dataset,

After going through this exercise, perform association rule learning on your dataset. Turn in both the R code for the exercise, and the R code for the practice using your datasets. You want to explore different thresholds, use the interactive vis tools provided by arulesViz, and find and report at least two interesting association rules from your dataset.

rm(list=ls())
load("LifeExpectancyData_3.Rdata")
data$Status = as.factor(data$Status)
data$Year = as.factor(data$Year)
data$Country = as.factor(data$Country)


data$Life <- cut(data$`Life expectancy`, 
                              breaks = 3, 
                              labels=c('low LE', 'middle LE', 'high LE'))
data$infantmort <- cut(data$`infant deaths`, 
                              breaks = 3, 
                              labels=c('low_infant', 'middle_infant', 'high_infant'))
data$expenditure <- cut(data$`Total expenditure`, 
                              breaks = 3, 
                              labels=c('low_expenditure','med_expenditure', 'high_expenditure'))
data$pop <- cut(data$Population, 
                              breaks = 3, 
                              labels=c('low_pop','med_pop', 'high_pop'))
data$bmi <- cut(data$BMI, 
                              breaks = 3, 
                              labels=c('low_BMI','med_BMI', 'high_BMI'))  
data$gdp <- cut(data$GDP, 
                              breaks = 3, 
                              labels=c('low_GDP','med_GDP', 'high_GDP'))  
data$alcohol <- cut(data$Alcohol, 
                              breaks = 3, 
                              labels=c('low_alc','med_alc', 'high_alc'))  

drops <- c("Alcohol",
           "percentage expenditure",
           "BMI",
           "GDP",
           "Diphtheria",
           "Life expectancy",
           "Measles",
           "infant deaths",
           "Total expenditure", 
           "thinness  1-19 years",  
           "thinness 5-9 years", 
           "Population", 
           "HIV/AIDS", 
           "Polio",
           "under-five deaths",
           "Adult Mortality",
           "Income composition of resources",
           "Schooling")
data <- data[ , !(names(data) %in% drops)]

transform the dataframe b to a transactions dataset, where each row is described by a set of binary variables (this is “bitmap indexing” we learned in Chapter 4 in the textbook) transactions data are often very large and sparse, directly looking at it won’t give your much information.You can see how the columns are constructed by using colnames(), or see a summary() of it. To see the records, use inspect(): inspect(b[1:9]) show the first 9 transactions.

b <- as(as.data.frame(data), "transactions") 

inspect(b[1:3]) 
##     items                         transactionID
## [1] {Country=Afghanistan,                      
##      Year=2015,                                
##      Status=Developing,                        
##      region=Asia,                              
##      Life=middle LE,                           
##      infantmort=low_infant,                    
##      expenditure=med_expenditure,              
##      pop=low_pop,                              
##      bmi=low_BMI,                              
##      gdp=low_GDP,                              
##      alcohol=low_alc}                         1
## [2] {Country=Afghanistan,                      
##      Year=2014,                                
##      Status=Developing,                        
##      region=Asia,                              
##      Life=middle LE,                           
##      infantmort=low_infant,                    
##      expenditure=med_expenditure,              
##      pop=low_pop,                              
##      bmi=low_BMI,                              
##      gdp=low_GDP,                              
##      alcohol=low_alc}                         2
## [3] {Country=Afghanistan,                      
##      Year=2013,                                
##      Status=Developing,                        
##      region=Asia,                              
##      Life=middle LE,                           
##      infantmort=low_infant,                    
##      expenditure=med_expenditure,              
##      pop=low_pop,                              
##      bmi=low_BMI,                              
##      gdp=low_GDP,                              
##      alcohol=low_alc}                         3
ars <- apriori(b, parameter = list(support=0.30, confidence=0.75))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.75    0.1    1 none FALSE            TRUE       5     0.3      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 881 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[236 item(s), 2938 transaction(s)] done [0.00s].
## sorting and recoding items ... [11 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [222 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
#222 rules created
#find rules generated from maximal/closed itemsets: 
#maximal itemsets
inspect(head(subset(ars, subset=is.maximal(ars), 5, by="support")))
##     lhs                        rhs                       support confidence      lift count
## [1] {Status=Developing,                                                                    
##      bmi=med_BMI}           => {infantmort=low_infant} 0.3430905  1.0000000 1.0054757  1008
## [2] {infantmort=low_infant,                                                                
##      bmi=med_BMI}           => {Status=Developing}     0.3430905  0.8083400 0.9789377  1008
## [3] {Status=Developing,                                                                    
##      Life=high LE}          => {infantmort=low_infant} 0.3573860  1.0000000 1.0054757  1050
## [4] {Status=Developing,                                                                    
##      bmi=low_BMI,                                                                          
##      alcohol=low_alc}       => {infantmort=low_infant} 0.3012253  0.9833333 0.9887178   885
## [5] {infantmort=low_infant,                                                                
##      bmi=low_BMI,                                                                          
##      alcohol=low_alc}       => {Status=Developing}     0.3012253  0.9899329 1.1988552   885
## [6] {Status=Developing,                                                                    
##      infantmort=low_infant,                                                                
##      bmi=low_BMI}           => {alcohol=low_alc}       0.3012253  0.8171745 1.3641243   885
inspect(head(subset(ars, subset=rhs %in% "Life=low LE"), 5, by="support"))
inspect(head(subset(ars, subset=rhs %in% "Life=middle LE"), 5, by="support"))
inspect(head(subset(ars, subset=rhs %in% "Life=high LE"), 5, by="support"))
# no rules associated with these
# need to find freq itemsets to find closed itemsets:
freq.itemsets <- apriori(b, parameter=list(target="frequent itemsets", support=0.25))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5    0.25      1
##  maxlen            target   ext
##      10 frequent itemsets FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 734 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[236 item(s), 2938 transaction(s)] done [0.00s].
## sorting and recoding items ... [14 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [163 set(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
closed = freq.itemsets[is.closed(freq.itemsets)]
inspect(head(subset(freq.itemsets, subset=is.closed(freq.itemsets), 5, by="support")))
##     items                         support   count
## [1] {region=Asia}                 0.2559564  752 
## [2] {Life=middle LE}              0.3801906 1117 
## [3] {bmi=low_BMI}                 0.3982301 1170 
## [4] {expenditure=low_expenditure} 0.5125936 1506 
## [5] {alcohol=low_alc}             0.5990470 1760 
## [6] {pop=low_pop}                 0.7763785 2281
#find maximal itemsets
maximal = freq.itemsets[is.maximal(freq.itemsets)]
summary(maximal)
## set of 17 itemsets
## 
## most frequent items:
## infantmort=low_infant     Status=Developing           gdp=low_GDP 
##                    15                     9                     9 
##           pop=low_pop          Life=high LE               (Other) 
##                     7                     4                    20 
## 
## element (itemset/transaction) length distribution:sizes
## 2 3 4 5 6 
## 3 4 5 4 1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.765   5.000   6.000 
## 
## summary of quality measures:
##     support           count       
##  Min.   :0.2505   Min.   : 736.0  
##  1st Qu.:0.2512   1st Qu.: 738.0  
##  Median :0.2828   Median : 831.0  
##  Mean   :0.2791   Mean   : 820.1  
##  3rd Qu.:0.2941   3rd Qu.: 864.0  
##  Max.   :0.3485   Max.   :1024.0  
## 
## includes transaction ID lists: FALSE 
## 
## mining info:
##  data ntransactions support confidence
##     b          2938    0.25          1
#check shorter rules
inspect(head(subset(ars, subset= size(lhs)<5 & size(lhs) >1), 5, by="support"))
##     lhs                        rhs                       support confidence     lift count
## [1] {pop=low_pop,                                                                         
##      gdp=low_GDP}           => {infantmort=low_infant} 0.7307692  0.9949027 1.000350  2147
## [2] {infantmort=low_infant,                                                               
##      pop=low_pop}           => {gdp=low_GDP}           0.7307692  0.9458150 1.176961  2147
## [3] {infantmort=low_infant,                                                               
##      gdp=low_GDP}           => {pop=low_pop}           0.7307692  0.9155650 1.179277  2147
## [4] {Status=Developing,                                                                   
##      gdp=low_GDP}           => {infantmort=low_infant} 0.6773315  0.9920239 0.997456  1990
## [5] {Status=Developing,                                                                   
##      infantmort=low_infant} => {gdp=low_GDP}           0.6773315  0.8257261 1.027524  1990
#note the above rules have high support and confidence but low lift.
inspect(head(subset(ars, subset= size(lhs)<5 & size(lhs) >1 & lift > 1), 5, by="support"))
##     lhs                        rhs                       support confidence     lift count
## [1] {pop=low_pop,                                                                         
##      gdp=low_GDP}           => {infantmort=low_infant} 0.7307692  0.9949027 1.000350  2147
## [2] {infantmort=low_infant,                                                               
##      pop=low_pop}           => {gdp=low_GDP}           0.7307692  0.9458150 1.176961  2147
## [3] {infantmort=low_infant,                                                               
##      gdp=low_GDP}           => {pop=low_pop}           0.7307692  0.9155650 1.179277  2147
## [4] {Status=Developing,                                                                   
##      infantmort=low_infant} => {gdp=low_GDP}           0.6773315  0.8257261 1.027524  1990
## [5] {infantmort=low_infant,                                                               
##      gdp=low_GDP}           => {Status=Developing}     0.6773315  0.8486141 1.027712  1990

Plotting rules by confidence and support

plot(ars, engine = "htmlwidget", jitter = 0)
#grouped 
somerules <- subset(ars, subset=size(lhs) & confidence>0.90 & support>0.3)
plot(somerules, method="grouped")

plot(somerules, method="graph", engine="htmlwidget")

Discuss a couple of interesting rules mined.

The Association rules produced from our data set have quite low lift, meaning the rules aren’t relatively important. Whereas in the Boston dataset with lifts of over 2, our own dataset only show a maximum lift of around 1.2.

In addition, our main variable of interest, life expectancy, has very few rules with higher support. When it does show up (with low support), it associated with low infant mortality, low gdp, and in developing countries. However, the stronger rules below are more associated with infant mortality rates than life expectancy.

The Association rules produced from our data set have quite low lift, meaning the rules aren’t relatively important. Whereas in the Boston dataset with lifts of over 2, our own dataset only show a maximum lift of around 1.2. In addition, our main variable of interest, life expectancy, has very few rules with higher support.

Rule 1 {Status=Developing, infantmort=low_infant, expenditure=low_expenditure} => {alcohol=low_alc}
support: 0.385
confidence: 0.811
lift: 1.35

Rule 2 {Status=Developing, Life=middle LE, infantmort=low_infanct, gdp=low_GDP} => {pop=low_pop}
support: 0.307
confidence: 0.991
lift: 1.28

Rule 3 {Status=Developing, infantmort=low_infanct, bmi=low_BMI} => {pop=low_pop}
support: 0.302
confidence: 0.818
lift: 1.05